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Let Y be a Gaussian vector whose components are independent 
with a common unknown variance. We consider the problem of esti- 
mating the mean /i of Y by model selection. More precisely, we start 
with a collection S = {S m ,m £ M} of linear subspaces of 1™ and 
associate to each of these the least-squares estimator of /i on S m - 
Then, we use a data driven penalized criterion in order to select one 
estimator among these. Our first objective is to analyze the perfor- 
mance of estimators associated to classical criteria such as FPE, AIC, 
BIC and AMDL. Our second objective is to propose better penalties 
that are versatile enough to take into account both the complexity 
of the collection 5 and the sample size. Then we apply those to solve 
various statistical problems such as variable selection, change point 
detections and signal estimation among others. Our results are based 
on a nonasymptotic risk bound with respect to the Euclidean loss for 
the selected estimator. Some analogous results are also established 
for the Kullback loss. 

1. Introduction. Let us consider the statistical model 
(1.1) Yi = fa + aei, i = l,...,n, 

where the parameters \i = (//i, . . . , /%)' £ W 1 and a > are both unknown 
and the £j's are i.i.d. standard Gaussian random variables. We want to esti- 
mate \i by model selection on the basis of the observation of Y = (Yi , . . . , Y n )' . 

To do this, we introduce a collection S = {S m , m £ Ai} of linear subspaces 
of W 1 , that hereafter will be called models, indexed by a finite or countable 
set M. . To each m G M. we can associate the least-squares estimator fi m = 
II m y of [i relative to S m where II m denotes the orthogonal projector onto 
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S m . Let us denote by D m the dimension of S m for m £ M and || • || the 
Euclidean norm on W 1 . The quadratic risk — /*m|| 2 ] of jx m with respect 
to this distance is given by 

(1.2) E[||/i-/i m || 2 ]= inf \\fi-s\\ 2 + D m a 2 . 

If we use this risk as a quality criterion, a best model is one minimizing 
the right-hand side of (1.2). Unfortunately, such a model is not available to 
the statistician since it depends on the unknown parameters fi and a 2 . A 
natural question then arises: to what extent can we select an element rh(Y) 
of M depending on the data only, in such a way that the risk of the selected 
estimator fi^ be close to the minimal risk 

(1.3) R(»,S)= mf E[\\fi-fi m \\ 2 ]. 

The art of model selection is to design such a selection rule in the best 
possible way. The standard way of solving the problem is to define m as the 
minimizer over M of some empirical criterion of the form 

(1.4) crit i( m) = ||y-n m rffi + ^) 

V n - D m J 

or 

(1.5) Crit^(m) = ^logf— — UmY ^ \ + I pen / ( m ) ) 

2 \ n ) 2 

where pen and pen' denote suitable (penalty) functions mapping M. into 
K + . Note that these two criteria are equivalent (they select the same model) 
if pen and pen' are related in the following way: 

pen'(m) = n log (l + pen ( m ) ^ D r pen(m) = (n - D m ){e pcn '^/ n - 1). 
V n- D m J 

The present paper is devoted to investigating the performance of criterion 
(1.4) or (1.5) as a function of collection S and pen or pen'. More precisely, 
we want to deal with the following problems: 

(PI) Given some collection S and an arbitrary nonnegative penalty function 
pen on Ai, what will the performance — Amll 2 ] of jlfh be? 

(P2) What conditions on S and pen ensure that the ratio E[||/i — £irh\\ 2 ]/ R([i,S) 
is not too large. 

(P3) Given a collection S, what penalty should be recommended in view of 
minimizing (at least approximately) the risk of /im? 

It is beyond the scope of this paper to make an exhaustive historical 
review of the criteria of the form (1.4) and (1.5). We simply refer the inter- 
ested reader to the first chapters of McQuarrie and Tsai (1998) for a nice 
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and complete introduction to the domain. Let us only mention here some of 
the most popular criteria, namely FPE, AIC, BIC (or SIC) and AMDL 
which correspond respectively to the choices pen(m) = 2D m , pen'(m) = 
2D m , pen'(m) = D m log(n) and pen'(m) = 3D m log(n). FPE was introduced 
in Akaike (1969) and is based on an unbiased estimate of the mean squared 
prediction error. AIC was proposed later by Akaike (1973) as a Kullback- 
Leibler information based model selection criterion. BIC and SIC are equiva- 
lent criteria which were respectively proposed by Schwarz (1978) and Akaike 
(1978) from a Bayesian perspective. More recently, Saito (1994) introduced 
AMDL as an information-theoretic based criterion. AMDL turns out to be a 
modified version of the Minimum Description Length criterion proposed by 
Rissanen (1983, 1984). The motivations for the construction of FPE, AIC, 
SIC and BIC criteria are a mixture of heuristic and asymptotic arguments. 
From both the theoretical and the practical point of view, these penalties 
suffer from the same drawback: their performance heavily depends on the 
sample size and the collection S at hand. 

In recent years, more attention has been paid to the nonasymptotic point 
of view and a proper calibration of penalties taking into account the com- 
plexity (in a suitable sense) of the collection S. A pioneering work based on 
the methodology of minimum complexity and dealing with discrete models 
and various stochastic frameworks including regression appeared in Barron 
and Cover (1991) and Barron (1991). It was then extended to various types 
of continuous models in Barron, Birge and Massart (1999) and Birge and 
Massart (1997, 2001a, 2007). Within the Gaussian regression framework, 
Birge and Massart (2001a, 2007) consider model selection criteria of the 
form 

(1.6) crit(m) = \\Y — £b m \\ 2 + pen(m)(7 2 

and propose new penalty structures which depend on the complexity of the 
collection S. These penalties can be viewed as generalizing Mallows' C p 
[heuristically introduced in Mallows (1973)] which corresponds to the choice 
pen(m) = 2D m in (1.6). However, Birge and Massart only deal with the 
favorable situation where the variance a 2 is known, although they provide 
some hints to estimate it in Birge and Massart (2007). 

Unlike Birge and Massart, we consider here the more practical case where 
a 2 is unknown. Yet our approach is similar in the sense that our objective 
is to propose new penalty structures for criteria (1.4) [or (1.5)] which allow 
us to take both the complexity of the collection and the sample size into 
account. 

A possible application of the criteria we propose is variable selection in 
linear models. This problem has received a lot of attention in the literature. 
Recent development includes Tibshirani (1996) with the LASSO, Efron et 
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al. (2004) with LARS, Candes and Tao (2007) for the Dantzig selector, 
Zou (2006) with the Adaptive LASSO, among others. Most of the recent 
literature assumes that a 2 is known, or suitably estimated, and aim at de- 
signing an algorithm that solves the problem in polynomial time at the price 
of assumptions on the covariates to select. In contrast, our approach assumes 
nothing on a 2 or the covariates, but requires that the number of these is not 
too large for a practical implementation. 

The paper is organized as follows. In Section 2 we start with some ex- 
amples of model selection problems among which variable selection, change 
point detection and denoising. This section gives the opportunity to both 
motivate our approach and make a review of some collections of models of 
interest. We address problem (P2) in Section 3 and analyze there FPE, AIC, 
BIC and AMDL criteria more specifically. In Section 4 we address problems 
(PI) and (P3) and introduce new penalty functions. In Section 5 we show 
how the statistician can take advantage of the flexibility of these new penal- 
ties to solve the model selection problems given in Section 2. Section 6 is 
devoted to two simulation studies allowing to assess the performances of our 
estimator. In the first one we consider the problem of detecting the nonzero 
components in the mean of a Gaussian vector and compare our estimator 
with BIC, AIC and AMDL. In the second study, we consider the variable 
selection problem and compare our procedure with the adaptive Lasso pro- 
posed by Zou (2006). In Section 7 we provide an analogue of our main result 
replacing the L 2 -loss by the Kullback loss. The remaining sections are de- 
voted to the proofs. 

To conclude this section, let us introduce some notation to be used through- 
out the paper. For each m£ M, D m denotes the dimension of S m , N m the 
quantity n — D m and fi m = II m ^. We denote by P^, a 2 the distribution of Y . 
We endow R n with the Euclidean inner product denoted (•,•). For all x € R, 
(x) + and \x\ denote respectively the positive and integer parts of x, and for 
i/SR, x A y = min{x, y} and x V y = max{x, y}. Finally, we write N* for the 
set of positive integers and \m\ for the cardinality of a set m. 

2. Some examples of model selection problems. In order to illustrate 
and motivate the model selection approach to estimation, let us consider 
some examples of applications of practical interest. For each example, we 
shall describe the statistical problem at hand and the collection of models of 
interest. These collections will be characterized by a complexity index which 
is defined as follows. 

Definition 1. Let M and a be two nonnegative numbers. We say that 
a collection S of linear spaces {S m ,m S M} has a finite complexity index 
(M, a) if 

\{m £ M , D m = D}\ < Me aD for all D>1. 
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Let us note here that not all countable families of models do have a finite 
complexity index. 

2.1. Detecting nonzero mean components. The problem at hand is to 
recover the nonzero entries of a sparse high-dimensional vector [i observed 
with additional Gaussian noise. We assume that the vector fi in (1.1) has at 
most p<n — 2 nonzero mean components but we do not know which are the 
null of these. Our goal is to find m* = {i G {1, . . . ,n}\f/,i ^ 0} and estimate 
fi. Typically, |m*| is small as compared to the number of observations n. 
This problem has received a lot of attention in the recent years and various 
solutions have been proposed. Most of them rely on thresholding methods 
which require a suitable estimator of a 2 . We refer the interested reader to 
Abramovitch et al. (2006) and the references therein. Closer to our approach 
is the paper by Huet (2006) which is based on a penalized criterion related 
to AIC. 

To handle this problem, we consider the set M. of all subsets of {1, . . . , n} 
with cardinality not larger than p. For each m G A4, we take for S m the linear 
space of those vectors s in W 1 such that Sj = for i ^ m. By convention, 
Sz = {0}. Since the number of models with dimension D is (^) < n D , a 
complexity index for this collection is (M, a) = (l,logn). 

2.2. Variable selection. Given a set of explanatory variables , . . . , x^ 
and a response variable y observed with additional Gaussian noise, we want 
to find a small subset of the explanatory variables that adequately explains 
y. This means that we observe (Y^a^ 1 , . . . ,x\ N ') for i = 1, . . . ,n, where x*f' 
corresponds to the observation of the value of the variable x^ in experiment 
number i, Yi is given by (1.1) and \Xi can be written as 

N 

_ (?) 

Hi — a j x i j 
i=i 

where the Gtj's are unknown real numbers. Since we do not exclude the 
practical case where the number N of explanatory variables is larger than 
the number n of observations, this representation is not necessarily unique. 
We look for a subset m of {1, . .. , A^} such that the least-squares estimator 

/t m of h based on the linear span S m of the vectors x^ = (x± \ . . . ,xji 
j G m, is as accurate as possible, restricting ourselves to sets m of cardinality 
bounded by p < n — 2. By convention S = {0}. 

A nonasymptotic treatment of this problem has been given by Birge and 
Massart (2001a), Candes and Tao (2007) and Zou (2006) when a 2 is known. 
To our knowledge, the practical case of an unknown value of a 2 has not 
been analyzed from a nonasymptotic point of view. Note that when N >n 
the traditional residual least-squares estimator cannot be used to estimate 
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a 2 . Depending on our prior knowledge on the relative importance of the 
explanatory variables, we distinguish between two situations. 

2.2.1. A collection for "the ordered variable selection problem." We con- 
sider here the favorable situation where the set of explanatory variables 

ordered according to decreasing importance up to rank p and 
introduce the collection 

M o = {{l,...,d},l<d<p}U0, 

subsets of {1, . . . ,N}. Since the collection contains at most one model per 
dimension, the family of models {S m ,m £ A4 Q } has a complexity index 
(M,a) = (1,0). 

2.2.2. A collection for "the complete variable selection problem." If we do 
not have much information about the relative importance of the explanatory 
variables x^\ it is more natural to choose for M. the set of all subsets of 
{1, . . . ,N} of cardinality not larger than p. For a given D > 1, the number 
of models with dimension D is at most (^) < N D so that (M, a) = (1, log N) 
is a complexity index for the collection {S m ,m £ Ai}. 

2.3. Change-points detection. We consider the functional regression frame- 
work 

Yi = f{xi) + a£i, i = l,..., n, 

where {x\ = 0, ...,x n } is an increasing sequence of deterministic points of 
[0, 1) and / an unknown real valued function on [0, 1). This leads to a partic- 
ular instance of (1.1) with /ij = f(%i) for % = 1, . . . , n. In such a situation, the 
loss function — fi\\ 2 = Y^=\{f{xi) ~ f( x i)) 2 1S the discrete norm associated 
to the design {x\, . . . , x n }. 

We assume here that the unknown / is either piecewise constant or piece- 
wise linear with a number of change-points bounded by p. Our aim is to 
design an estimator / which allows to estimate the number, locations and 
magnitudes of the jumps of either / or /', if any. The estimation of change- 
points of a function / has been addressed by Lebarbier (2005) who proposed 
a model selection procedure related to Mallows' C p . 

2.3.1. Models for detecting and estimating the jumps of f . Since our loss 
function only involves the values of / at the design points, natural models 
are those induced by piecewise constant functions with change-points among 
{x2, ■ ■ ■ , x n }. A potential set m of q change-points is a subset {t\, . . . , t q } of 
{x2, . . . , x n } with t\ < ■ ■ ■ < t q , q G {0, . . . , p} with p < n — 3, the set being 
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empty when q = 0. To a set m of change-points {t\, . . . ,t q } we associate the 
model 

Sm = {(g(xi), ■ ■ ■ ,g(x n ))',g £ T m }, 

where J- m is the space of piecewise constant functions of the form 

q 

5Z a :?%>*i+i) with ( a 0, •••,«(?) G M 9+1 ,io=£l and £ g+ i = 1, 

3=0 

so that the dimension of S m is \m\ + 1. Then we take for A4 the set of 
all subsets of {x2, • • ■ ,x n } with cardinality bounded by p. For any D with 
1 < D < p + 1 the number of models with dimension D is (j^Zi) ^ n£> so 
that (M, a) = (l,logn) is a complexity index for this collection. 

2.3.2. A collection of models for detecting and estimating the jumps of f . 
Let us now turn to models for piecewise linear functions g on [0, 1) with 
q + 1 pieces so that g' has at most q < p jumps. We assume p < n — 4. We 
denote by C([0, 1)) the set of continuous functions on [0,1) and set to = 
and t q+ \ = 1, as before. Given two nonnegative integers j and q such that 
q < J , we set Vj = {k2~i , k = 1, . . . , 2 J 1 — 1} and define 

= • • ■ , t J C Vj, h < ■ ■ ■ < t q } 

and 

/ (2*-l)Ap \ 

■M= U U 

\j>1 g =l / 

For each m = {t±, . . . , t q } £ A4 (with m = if q = 0), we define ,F m as the 
space of splines of degree 1 with knots in m, that is, 

U=o J 
and the corresponding model 

S m = {(g(xi), . . .,g(x n ))',ge T m } c R n . 

Note that 2 < dim(5 m ) < dim(J r m ) = |m| + 2 because of the continuity con- 
straint. Besides, let us observe that M is countable and that the number 
of models S m with a dimension D in {1, . . . ,p + 2} is infinite. This implies 
that the collection has no (finite) complexity index. 
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2.4. Estimating an unknown signal. We consider the problem of esti- 
mating a (possibly) anisotropic signal in M. d observed at discrete times with 
additional noise. This means that we observe the vector Y given by (1.1) 
with 

(2.1) m = f{xi), i = l,...,n, 

where xi, . . . ,x n G [0, l) d and / is an unknown function mapping [0, l) d into 
K. To estimate / we use models of piecewise polynomial functions on parti- 
tions of [0, l) d into hyperrectangles. We consider the set of indices 

M = {(r, h,..., k d ),r G N, fci, . . . , k d G N* with (r + 1)% • • • k d < n - 2}. 

For m = (r, k±, . . . , kd) £ A4, we set J m = IIi=i{l> • • • > an d denote by T m 
the space of piecewise polynomials P such that the restriction of P to each 
hyperrectangle nf=i[(j« — l)^ rl > Ji^i" 1 ) with j G J7" m is a polynomial in d 
variables of degree not larger than r. Finally, we consider the collection of 
models 

S m = {(P(x 1 ), P(x n ))', P G F m }, meM. 

Note that when m = (r, k\, . . . , k d ), the dimension of S m is not larger than 
(r + l) d ki ■ ■ - kd- A similar collection of models was introduced in Barron, 
Birge and Massart (1999) for the purpose of estimating a density on [0, l) d 
under some Holderian assumptions. 

3. Analyzing penalized criteria with regard to family complexity. Through- 
out the section, we set 4>(x) = (x — 1 — log(x))/2 for x > 1 and denote by 
cf>~ 1 the reciprocal of cj). We assume that the collection of models satisfies 
for some K > 1 and (M, a) G M 2 ^ the following assumption. 

Assumption (H-K,M,a)- The collection of models S = {S m ,m G M} has 
a complexity index (M, a) and satisfies 

VmeM,D m <D m3X =l(n - 7 i)+J A L((n + 2) 72 - 1) + J , 

where 

71 = (2ta,x) V —= -, 

72 (to,* - I) 2 



and 



t a , K = Kr\a)>l. 
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If a = and a = log(ra), Assumption (H.k,m,o) amounts to assuming D m < 
5(K)n and D m < 5{K)n/ log 2 (ra), respectively, for all m £ M where 
S(K) < 1 is some constant depending on A only. In any case, since 72 < 
2(p(K)(K - 1)~ 2 < 1/2, Assumption (Bk M a) implies that £> max < ra/2. 



3.1. Bounding the risk of fim under penalty constraints. 
holds. 



The following 



Theorem 1. Let A > 1 and (M,a) e IR+. Assume that the collection 
S = {S m ,m£M.} satisfies (Hk,M,o)- If rh is selected as a minimizer of 
Criti [defined by (1-4)] among M. and if pen satisfies 

pen(m) > K 2 ^' 



(3.1) 

then the estimator fi^ satisfies 
E 



(a)Dr 



Vm G M, 



(3.2) 



where 



< 



A" 



inf 

A — 1 mew 



A 



A 



A — 1 



A* — Mm A . pen(m )\ 
5 1 H 7^ — + pen(m) - D T 

8KMe~ a 



+ R, 



A 2 0" 1 (a) + 2A + 



In particular, «/pen(m) = K 2 (f)~ l (a)D m for all m£ M, 

(3.3) E[\\fi - Arf.ll 2 ] < C<j>-\a)[R{^S) V a 2 ] 

where C is a constant depending on K and M only and R(/jl, S) the quantity 
defined at equation (1.3). 

If we exclude the situation where {0} £ S, one has R(fi,S) > cr 2 . Then, 
(3.3) shows that the choice pen(m) = K 2 <j)" 1 (a)D m leads to a control of the 
ratio E[||/x — firh]\ 2 ]/ R(fi,S) by the quantity C(j)~ 1 (a) which only depends on 
A and the complexity index (M, a). For a typical collection of models, a is 
either of order of a constant (independent of n) or of order of a log(n). In the 
first case, the risk bound we get leads to an oracle-type inequality showing 
that the resulting estimator achieves up to constant the best trade-off be- 
tween the bias and the variance term. In the second case, 4>~ 1 (a) is of order 
of a log(n) and the risk of the estimator differs from R(fi,S) by a logarithmic 
factor. For the problem described in Section 2.1, this extra logarithmic fac- 
tor is known to be unavoidable [see Donoho and Johnstone (1994), Theorem 
3]. We shall see in Section 3.3 that the constraint (3.1) is sharp at least in 
the typical situations where a = and a = log(n). 



10 



Y. BARAUD, C. GIRAUD AND S. HUET 



3.2. Analysis of some classical penalities with regard to complexity. In 
the sequel, we make a review of classical penalties and analyze their perfor- 
mance in the light of Theorem 1. 

FPE and AIC. As already mentioned, FPE corresponds to the choice 
pen(m) = 2D m . If the complexity index a belongs to [0, (f)(2)) [4>(2) ps 0.15], 
then this penalty satisfies (3.1) with K = y/2/(j)~ 1 (a) > 1. If the complexity 
index of the collection is (M, a) = (1, 0), by assuming that 

An < mimjn - 6, 0.39(n + 2) - 1} 

we ensure that Assumption (H-K,M,a) holds and we deduce from Theorem 1 
that (3.2) is satisfied with K/(K — 1) < 3.42. For such collections, the use of 
FPE leads thus to an oracle-type inequality. The AIC criterion corresponds 
to the penalty pen(m) = N m (e 2Dm ^ n — 1) > 2N m D m /n and has thus simi- 
lar properties provided that N m /n remains bounded from below by some 
constant larger than 1/2. 

AMDL and BIC. The AMDL criterion corresponds to the penalty 

(3.4) pen(m) = N m (e 3DmlogM/n - 1) > 3JV m n _1 D m log(n). 

This penalty can cope with the (complex) collection of models introduced 
in Section 2.1 for the problem of detecting the nonzero mean components in 
a Gaussian vector. In this case, the complexity index of the collection can 
be taken as (M, a) = (l,log(n)) and since 4>~ l (a) < 21og(n), inequality (3.1) 
holds with K = v2. As soon as for all m £ M. , 

/ \ f , s 0.06(n + 2) 1 

(3.5) D m < minjn - 5.71og(n), ^-^ _ ^ - l], 

Assumption (%,M,a) is fulfilled and jim then satisfies (3.2) with Kj[K — 
1) < 3.42. Actually, this result has an asymptotic flavor since (3.5) and there- 
fore (Kk,m,o,) hold for very large values of n only. For a more practical point 
of view, we shall see in Section 6 that AMDL penalty is too large and thus 
favors small dimensional linear spaces too much. The BIC criterion corre- 
sponds to the choice pen(m) = N m (e Dml ° s( - n ^ n — 1) and one can check that 
pen(m) stays smaller than ^ -1 (log(ra))D m when n is large. Consequently, 
Theorem 1 cannot justify the use of the BIC criterion for the collection 
above. In fact, we shall see in the next section that BIC is inappropriate in 
this case. 

When the complexity parameter a is independent of n, criteria AMDL 
and BIC satisfy (3.1) for n large enough. Nevertheless, the logarithmic factor 
involved in these criteria has the drawback to overpenalize large dimensional 
linear spaces. One consequence is that the risk bound (3.2) differs from an 
oracle inequality by a logarithmic factor. 
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3.3. Minimal penalties. The aim of this section is to show that the con- 
straint (3.1) on the size of the penalty is sharp. We shall restrict ourselves to 
the cases where a = and a = log(ra). Similar results have been established 
in Birge and Massart (2007) for criteria of the form (1.6). The interested 
reader can find the proofs of the following propositions in Baraud, Giraud 
and Huet (2007). 

3.3.1. Casea = 0. For collections with such a complexity index, we have 
seen that the conditions of Theorem 1 are fulfilled as soon as pen(m) > 
CD m for all m and some universal constant C > 1. Besides, the choice of 
penalties of the form pen(m) = CD m for all m leads to oracle inequalities. 
The following proposition shows that the constraint C > 1 is necessary to 
avoid the overfitting phenomenon. 

Proposition 1. Let S = {S m ,m e M} be a collection of models with 
complexity index (1,0). Assume that pen(m) < for some rh G M and 
set C = pen(m)/Dm- If [J> = 0, the index rh which minimizes criterion (1-4) 
satisfies 



where c and d are positive functions of C only. 

Explicit values of c and d can be found in the proof. 

3.3.2. Case a = log(n). We restrict ourselves to the collection described 
in Section 2.1. We have already seen that the choice of penalties of the form 
pen(m) = 2CD m \ogn for all m with C > 1 was leading to a nearly optimal 
bias and variance trade-off [up to an unavoidable log(n) factor] in the risk 
bounds. We shall now see that the constraint C > 1 is sharp. 

Proposition 2. Let Co S]0, 1[. Consider the collection of linear spaces 
S = {S m \m S M.} described in Section 2.1, and assume that p < (1 — Cq)u 
and n > e 2 ^ c ° . Let pen be a penalty satisfying pen(m) < 2C§D m log(n) for 
all m€ M.. If fi = 0, the cardinality of the subset rh selected as a minimizer 
of criterion (1.4) satisfies 



where D = [c'n 1 c ° / log 3 ' 2 (n) J Ap and c, d are positive functions of Cq (to 
be explicitly given in the proof). 




P(|m| > L(l-C )£>J)>l-2exp -c 
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Proposition 2 shows that AIC and FPE should not be used for model se- 
lection purposes with the collection of Section 2.1. Moreover, if p\og(n)/n < 
k < log (2) then the BIC criterion satisfies 



and also appears inadequate to cope with the complexity of this collection. 

4. From general risk bounds to new penalized criteria. Given an arbi- 
trary penalty pen, our aim is to establish a risk bound for the estimator jlrh 
obtained from the minimization of (1.4). The analysis of this bound will lead 
us to propose new penalty structures that take into account the complexity 
of the collection. Throughout this section we shall assume that D m < n — 2 
for all m G M . 

The main theorem of this section uses the function Dkhi defined below. 

Definition 2. Let D,N be two positive numbers and Xd,Xn be two 
independent x 2 random variables with degrees of freedom D and N respec- 
tively. For x > 0, we define 



Note that for D and N fixed, x \— ► Dkhi[D, N, x] is decreasing from [0, +oo) 
into (0,1] and satisfies Dkhi[D, N, 0] = 1. 

Theorem 2. Let S = {S m ,m £ Ai} be some collection of models such 
that N m > 2 for all m G A4. Let pen be an arbitrary penalty function map- 
ping M into R + . Assume that there exists an index rh among M. which 
minimizes (1-4) with probability 1. Then, the estimator fifn satisfies for all 
constants c > and K > 1, 



pen(m) = N m {e 



,D m log(n)/ra 



l)<e K J D m log(n)<2 J D m log(n) 



(4.1) 




E 




(4.2) 




where 



Kc 



K — 1 
2K 2 



Y, (A„ + 1) Dkhi D m + l,N m -l 




(pen(m) + c) . 




ra&M 
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Note that a minimizer of Crit l does not necessarily exist for an arbitrary 
penalty function, unless M is finite. Take for example, M = Q n and for all 
m G M. set pen(m) = and S m the linear span of m. Since inf m£j v( \\Y — 
n m y|| 2 = and Y ^ U m eM a- s -> fii does not exist with probability 1. In 
the case where m does exist with probability 1, the quantity S appearing in 
right-hand side of (4.2) can either be calculated numerically or bounded by 
using Lemma 6 below. 

Let us now turn to an analysis of inequality (4.2). Note that the right-hand 
side of (4.2) consists of the sum of two terms, 

K . r||/i-jti m || 2 / pen(m)\ 

1? T m L 2 1 + —Tt + P en M - D m 

and £ = S(pen), which vary in opposite directions with the size of pen. 
There is clearly no hope in optimizing this sum with respect to pen without 
any prior information on [i. Since only £ depends on known quantities, 
we suggest choosing the penalty in view of controlling its size. As already 
seen, the choice pen(m) = i\" 2 (/> _1 (a)-D m for some K > 1 allows us to obtain 
a control of £ which is independent of n. This choice has the following 
drawbacks. First, the penalty penalizes the same all the models of a given 
dimension, although one could wish to associate a smaller penalty to some 
of these because they possess a simpler structure. Second, it turns out that 
in practice these penalties are a bit too large and leads to an underfitting 
of the true by advantaging too much small dimensional models. In order to 
avoid these drawbacks, we suggest to use the penalty structures introduced 
in the next section. 

4.1. Introducing new penalty functions. We associate to the collection of 
models S a collection C = {L m ,m G M} of nonnegative numbers (weights) 
such that 

(4.3) E' = (An + l)e~ im < +oo. 

When £' = 1 then the choice of sequence C can be interpreted as a choice 
of a prior distribution tt on the set A4. This a priori choice of a collection 
of L m 's gives a Bayesian flavor to the selection rule. We shall see in the 
next section how the sequence C can be chosen in practice according to the 
collection at hand. 

Definition 3. For < q < 1 we define EDkhi[D, TV, q] as the unique 
solution of the equation Dkhi[D, N, EDkhi[D, N, q]] = q. 

Given some K > 1, let us define the penalty function pen^£ 

(4.4) pen Ki£ (m) = K Nm _ EDkhi[£> m + l,N m - l,e~ L "'] Mm G M. 
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Proposition 3. If pen = pen^- £ for some sequence of weights C satisfy- 
ing (4-3), then there exists an indexrh among M which minimizes (1-4) with 
probability 1. Besides, the estimator fi m satisfies (4-2) with £ < 2K 2 T,' / (K — 
I)- 



As we shall see in Section 6.1, the penalty pen^-£ or at least an upper 
bound can easily be computed in practice. From a more theoretical point of 
view, an upper bound for pen^-^m) is given in the following proposition, 
the proof of which is postponed to Section 10.2. 



Proposition 4. Let m e M such that N m > 7 and D m > 1. We set 
D = D m + l, N = N m -l and 

L m + log5 + l/N 
1-5/N 

Then, we have the following upper bound on the penalty pen K c (m) : 
K(N + 1 



(4.5) pen KC (m) < 



N 



1 + e 2A/(7V + 2) 1 + 



2D \2A1 2 



N + 2J D 



D. 



When D m = and N m > A, we have the upper bound 

3K(N + 1 



(4.6) pen K C (m) < 



N 



l + e 



2L m /N t 



1+ N 



6 \ 2L m 1 ' 



In particular, if L m V D m < nn for some K<1, then there exists a constant 
C depending on k and K only, such that 

peri KC (m) < C(L m V D m ) 



for any m6M. 



We derive from Proposition 4 and Theorem 2 (with c = 0) the following 
risk bound for the estimator fi m . 



Corollary 1 . Let k<1. If for all m e M , N m > 7 and L m V D m < nn, 
then fi m satisfies 



(4.7) E 



E 


"Ma* - Amll 2 " 


< c 


inf ( 




[ ° 2 







+ D m V L m \ + S' 



where C is a positive quantity depending on k and K only. 
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Note that (4.7) turns out to be an oracle-type inequality as soon as one 
can choose L m of the order of D m for all m. Unfortunately, this is not 
always possible if one wants to keep the size of £' under control. Finally, 
let us mention that the structure of our penalties, pen^ £, is flexible enough 
to recover any penalty function pen by choosing the family of weights C 
adequately. Namely, it suffices to take 



L m = - log Dkhi 



(iV m -l)pen(m) 

to obtain pen^ c = pen. Nevertheless, this choice of C does not ensure 
that (4.3) holds true unless M. is finite. 

5. How to choose the weights. 

5.1. One simple way. One can proceed as follows. If the complexity index 
of the collection is given by the pair (M, a) , then the choice 

(5.1) L m = a'D m \/m^M 

for some a' > a leads to the following control of S': 

(a'-a)(D-l) = Mh _ e -(a'-a)^-2 



S'<M^ De~ {a - a )( D ~ 1 ) = M(l 



D>1 



In practice, this choice of C is often too rough. One of its nonattractive 
features lies in the fact that the resulting penalty penalizes the same all the 
models of a given dimension. Since it is not possible to give a universal recipe 
for choosing the sequence C, in the sequel we consider the examples presented 
in Section 2 and in each case motivate a choice of a specific sequence C by 
theoretical or practical considerations. 



5.2. Detecting nonzero mean components. For any D G {0, and 
m £ M. such that \m\ = D, we set 

+ 21og(D + l) 

and pen(m) =pen^-£(m) where K is some fixed constant larger than 1. 
Since pen(m) only depends on \m\, we write 

(5.2) pen(m) = pen(|m|). 

From a practical point of view, m can be computed as follows. Let Y?\ , . . . , Y, 
be random variables obtained by ordering Y^, ... ,1^ in the following way: 

Y (n) < Y (n-1) < ■ ■ < ^(1) a.S. 



L m = L(D) = log 
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and D the integer minimizing over D £ {0, . . . ,p} the quantity 

pen(-D) ' 
n-D 



(5-3) E W + 

i=D+l 



Then the subset m coincides with {(1), . . . , (D)} if D > 1 and otherwise. 
In Section 6 a simulation study evaluates the performance of this method 
for several values of K. 

From a theoretical point of view, our choice of L m 's implies the following 
bound on S': 

v 

-Up) 



D=0 
I' 



D=l 



< e - 1 + log(p + x) - 1 + log(n) 



As to the penalty, let us fix some m in M with \m\ = D. The usual 
bound log[(^)] < Dlog(n) implies L m < D{2 + logn) < p{2 + log(n)) and 
consequently, under the assumption 

KTi 

p < : A (n — 7) 

F ~2 + log?i v ' 

for some k < 1, we deduce from Corollary 1 that for some constant C = 
C'(k,K), the estimator /i,^ satisfies 

E[||^-/^|| 2 ] <C" inf [|| M -/i m || 2 + (D fn + l)log(n)a 2 ] 
^^'(l + lm*!)^^)^ 2 . 

As already mentioned, we know that the log(ra) factor in the risk bound 
is unavoidable. Unlike the former choice of C suggested by (5.1) [with a' = 
log(n) + 1, e.g.], the bound for £' we get here is not independent of n but 
rather grows with n at rate log(n). As compared to the former, this latter 
weighting strategy leads to similar risk bounds and to a better performance 
of the estimator in practice. 



5.3. Variable selection. We propose to handle simultaneously complete 
and ordered variable selection. First, we consider the p explanatory variables 
that we believe to be the most important among the set of the N possible 
ones. Then, we index these from 1 to p by decreasing order of importance 
and index those N —p remaining ones arbitrarily. We do not assume that our 
guess on the importance of the various variables is right or not. We define 
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M. and M. according to Section 2.2 and for some c > set L m = c\m\, if 
m 6 M Q , arid otherwise set 



L m = L(\m\) where L(D) = log 



+ logp + log(L> + 1). 



For K > 1 , we select the subset m as the minimizer among A4 of the crite- 
rion m i — » Crit^(m) given by (1.4) with pen(m) = pen^-^m). Except in the 
favorable situation where the vectors i^J are orthogonal in IR n there seems, 
unfortunately, to be no way of computing rh in polynomial time. Neverthe- 
less, the method can be applied for reasonable values of N and p as shown 
in Section 6.3. From a theoretical point of view, our choice of L m 's leads to 
the following bound on the residual term 

£'< {\m\+l)e~ Lm + (H + !)e~ im 

m£M m£M\M 



<J2(D + l)e-* D + j:(°)(D + l)e- L M 

D=Q D=l ^ ' 

<l + (l- e - C )" 2 . 



Besides, we deduce from Corollary 1 that if p satisfies 

Kn ku , 

p < — A A (n - 7) with k < 1, 

F ~ c 2+logiV v ; 

then 

(5.4) E[\\fi - A™|| 2 ] < C(k, K, c){B q A B c ), 
where 

B Q = inf (||^-^ m || 2 + (|m| + l)(j 2 ), 

m£Mo 

B c = inf [\\n-fi m \\ 2 + {\m\ + l)\og(eN)a 2 }. 

It is interesting to compare the risk bound (5.4) with the one we can get 
by using the former choice of weights C given in (5.1) [with a' = log(iV) + 1], 
that is 

(5.5) E[\\i2-fL^\\ 2 ]<C'(K,K)B c . 

Up to constants, we see that (5.4) improves (5.5) by a log(iV) factor whenever 
the minimizer m* of — /t m || 2 ] among A4 does belong to M - 



5.4. Multiple change-points detection. In this section, we consider the 
problems of change-points detection presented in Section 2.3. 
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5.4.1. Detecting and estimating the jumps of f . We consider here the 
collection of models described in Section 2.3.1 and associate to each m the 
weight L m given by 

r " 



L m = L(\m\) = log 



n 



m 



+ 21og(|m| +2), 



where K is some number larger than 1. This choice gives the following control 
on 

P / 1 \ P i 



S '=E 

D=0 



n — 1 



E 

D=0 



L> + 2 



<log(p + 2). 



Let D be some arbitrary positive integer not larger than p. If / belongs to 
the class of functions which are piecewise constant on an arbitrary partition 
of [0,1) into D intervals, then /x = (f(xi),...,f(x n ))' belongs to some S m 
with m £ M and \m\ < D — 1. We deduce from Corollary 1 that if p satisfies 

ku — 2 
p < - ; - A (n — 8) 



2 + log n 



for some k < 1, then 



E[\\fi-fl^f)<C( K ,K)Dlog(n)a 2 . 

5.4.2. Detecting and estimating the jumps of f. In this section, we deal 
with the collection of models of Section 2.3.2. Note that this collection is 
not finite. We use the following weighting strategy. For any pair of integers 
j, q such that q < 2 J — 1 , we set 

't? - r 
q 

Since an element m £ M may belong to different -Mj, g , we set L m = inf{L(j, q), 
m £ Mj,q}- This leads to the following control of £': 

S'<E E \Mte\(q + 3) 



L(j, q) = log 



+ g + 21ogj. 



j>l q=0 

<E4E(«+3)^ 

j>l J q>0 

= A(3e-|) 
6(e-l) 2 

For some positive integer q and R > 0, we define S 1 (q,R) as the set of 
continuous functions / on [0, 1) of the form 

9+1 



f( x ) = ^2( a i x + Pi^lai-umfa) 



i=l 



with = ao < cti < 
Ri +1 , such that 



and (ai, 
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< a q+ i = 1, (ft,...,/? g+i y G 



1 v 

The following result holds. 

Corollary 2. Assume that n > 9. Zei if > 1, k G ]0, 1[, k' > and p 

(5.6) p<(rai-2)A(n-9). 

Letf^S l {q,R) with q G {1, . . . ,p} and R<ae K ' n / q . If fi is defined by (2.1) 
then there exists a constant C depending on K and k, k' only such that 

nR 2 " 



E[\\fx-jl t 



\ 2 }<Cqa 2 



1 + log 1 V 



qo~ z 



We postpone the proof of this result to Section 10.3. 

5.5. Estimating a signal. We deal with the collection introduced in Sec- 
tion 2.4 and to each m = (r, k\ , . . . , k d ) G M., associate the weight L m = 
(r + l) d ki ■ ■ ■ kd- With such a choice of weights, one can show that £' < 
(e/(e - l)) 2 ( d+1 ). For a = (a 1: ..., a d ) and R = (R u . . . , R d ) in ]0,+oo[ d , we 
denote by 7i(a,R) the space of (a, i?)-H61derian functions on [0, l) d , which 
is the set of functions f:[0,l) d — > M such that for any i = 1, ...,d and 
ti,...,t d ,Zi G [0,1) 



dtl 



> • • • i 



tn ) 



—f(h, 



Of 



\/3i 



where r j + f3i = ct; , with r j G N and < < 1 . 

In the sequel, we set ||x|| 2 = ||x|| 2 /n for x G W 1 . By applying our procedure 
with the above weights and some K > 1, we obtain the following result. 



Corollary 3. Assume n > 14. Let a and R fulfill the two conditions 
n a R 2a+d >R d a 2a and n a R d > 2 a R d (r + l) da , for i = 1, . . . , d, 
where 



r= sup j-j, 
i=l,...,d 



1 d I s 



-i 



and R = {R* la \...,R a d /ad ) l / d . 



Then, there exists some constant C depending on r and d only, such that 
for any n given by (2.1) with f £Tt(a,R), 



n\\n-fi\\ 2 n ]<C 



/ ftd/a a 2\ 2a/(2a+d) 



n 



■ 



V 



R 2 \ 

n 2a/d) 
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The rate n 2a /( 2a + d ) [ s known to be minimax for density estimation in 
TC(a,R) [see Ibragimov and Khas'minskii (1981)]. 

6. Simulation study. In order to evaluate the practical performance of 
our criterion, we carry out two simulation studies. In the first study, we 
consider the problem of detecting nonzero mean components. For the sake of 
comparison, we also include the performances of AIC, BIC and AMDL whose 
theoretical properties have been studied in Section 3. In the second study, 
we consider the variable selection problem and compare our procedure with 
adaptive Lasso recently proposed by Zou (2006). From a theoretical point of 
view, this last method cannot be compared with ours because its properties 
are shown assuming that the error variance is known. Nevertheless, this 
method gives good results in practice and the comparison with ours may 
be of interest. The calculations are made with R (www.r-project.org) and 
are available on request. We also mention that a simulation study has been 
carried out for the problem of multiple change-points detection (see Section 
2.3). The results are available in Baraud, Giraud and Huet (2007). 

6.1. Computation of the penalties. The calculation of the penalties we 
propose requires that of the EDkhi function or at least an upper bound for 
it. For < q < 1, the value EDkhi(Z), N, q) is obtained by numerically solving 
for x the equation 

x \ x ( N + 2 

where Fd,n denotes a Fisher random variables with D and N degrees of 
freedom (see Lemma 6). However, this value of x cannot be determined 
accurately enough when q is too small. Rather, when q < e -500 and D > 2, we 
bound the value of EDkhi(Z), N, q) from above by solving for x the equation 

q _2 + NDx~ 1 ( N \ N/2 f x \ D/2 

2B(l + D/2,N/2)~ N(N + 2) \N + x) \N + x) 

where B(p,q) stands for the beta function. This upper bound follows from 
formula (9.6), Lemma 6. 

6.2. Detecting nonzero mean components. 

Description of the procedure. We implement the procedure as described 
in Sections 2.1 and 5.2. More precisely, we select the set {(1), . . . , (D)} where 
D minimizes among D in {1, . . . ,p} the quantity defined at equation (5.3). In 
the case of our procedure, the penalty function pen depends on a parameter 
K, and is equal to 



n — D ^ 
vmi K {D) = K n _ D _ i EDkhi 



D + l,n-D-l,^{D + l) 



n 
D 
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We consider the three values {1; 1.1; 1.2} for the parameter K and denote 
D by Dk, thus emphasizing the dependency on K. Even though the theory 
does not cover the case K = 1, it is worth studying the behavior of the 
procedure for this critical value. For the AIC, BIC and AMDL criteria, the 
penalty functions are respectively equal to 



pen AIC (L>) = {n-D) 



pen BIC (L>) = (n-D) 



PenAMDLp) = {n-D) 



[2D 
exp 

\ n 



1 



/ D login) 
exp I 



n 



exp I I - 1 



n 



We denote by .DaiC) -^bic and -Damdl the corresponding values of D. 

Simulation scheme. For 9 = (n,p, k, s) £ N x {(p,k) £ N 2 \k < p} xl, we 
denote by P# the distribution of a Gaussian vector Y in M. n whose compo- 
nents are independent with common variance 1 and mean fii = s, if i < k 
and fj,i = otherwise. Neither s nor k are known but we shall assume the 
upper bound p on k known: 

e = {(2^p,M,je{5,9,H,13},P= ln/log{n)\,keI p ,se {3,4,5}}, 
where 

I p = {2^j' = 0,...,[log 2 (p)\}U{0,p}. 



For each 9 G 0, we evaluate the performance of each criterion as follows. 
On the basis of the 1000 simulations of Y of law P# we estimate the risk 
R{6) =Eg[||// — /i^|| 2 ]- Then, if k is positive, we calculate the risk ratio 
r{9) = R(8)/0(9), where O{0) is the infimum of the risks over all m £ M. 
More precisely, 

[s 2 (k-D)I D < k + D]. 



inf 

D=0,...,p 



0(9)= inf Eg[\\ti-fir, 
It turns out that, in our simulation study, 0(9) = k for all n and s. 



Results. When k = 0, that is when the mean of Y is 0, the results for 
AIC, BIC and AMDL criteria are given in Table 1. The theoretical results 
given in Section 3.2 and 3.3.2 are confirmed by the simulation study: when 
the complexity of the model collection a equals log(ra), AMDL satisfies the 
assumption of Theorem 1 and therefore the risk remains bounded, while the 
AIC and BIC criteria lead to an over- fitting (see Proposition 2). In all sim- 
ulated samples, the BIC criterion selects a positive D and the AIC criterion 
chooses D equal to the largest possible dimension p. Our procedure, whose 
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Table 1 

Case k = 0. AIC, BIC and AMDL criteria: estimated risk R and percentage of the 
number of simulations for which D is positive 



n 




AIC 




BIC 




AMDL 


R 


-Daic > 


R 


-Desic > 


R 


Damdl > 


32 


24 


100% 


23 


99% 


0.65 


6.2% 


512 


296 


100% 


79 


100% 


0.05 


0.3% 


2048 


1055 


100% 


139 


100% 


0.02 


0.1% 


8192 


3830 


100% 


276 


100% 


0.09 


0.3% 



results are given in Table 2, performs similarly as AMDL. Since larger penal- 
ties tend to advantage small dimensional model, our procedure performs all 
the better that K is large. AMDL overpenalizes models with positive di- 
mension even more that n is large, and then performs all the better. 

When k is positive, Table 3 gives, for each n, the maximum of the risk 
ratios over k and s. Note that the largest values of the risk ratios are achieved 
for the AMDL criterion. Besides, the AMDL risk ratio is maximum for large 
values of k. This is due to the fact that the quantity 31og(n) involved in the 
AMDL penalty tends to penalize too severely models with large dimensions. 
Even in the favorable situation where the signal to noise ratio is large, AMDL 
criterion is unable to estimate k when k and n are both too large. For 
example, Table 4 presents the values of the risk ratios when k = n/16 and 
s = 5, for several values n. Except in the situation where n = 32 and k = 2, 
the mean of the selected -Damdl's is small although the true k is large. This 
overpenalization phenomenon is illustrated by Figure 1 which compares the 
AMDL penalty function with ours for K = 1.1. Let us now turn to the case 
where k is small. The results for k = 1 are presented in Table 5. When n = 32, 
the methods are approximately equivalent whatever the value of K. 

Finally, let us discuss the choice of K. When k is large, the risk ratios do 
not vary with K (see Table 4). Nevertheless, as illustrated by Table 5, K 



Table 2 

Case k = 0. Estimated risk R and percentage of the number of simulations for which D 
is positive using our penalty pen^- 



n 




K — 1 




K = 1.1 


K 


= 1.2 


R 


Dk>0 


R 


£>K >0 


R 


Dk > 


32 


0.67 


6.4% 


0.40 


3.7% 


0.25 


2.2% 


512 


0.98 


5.7% 


0.33 


1.9% 


0.07 


0.4% 


2048 


1.00 


5.1% 


0.48 


2.3% 


0.09 


0.4% 


8192 


0.96 


4.2% 


0.31 


1.2% 


0.14 


0.5% 
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Table 3 

For each n, maximum of the estimated risk ratios r max over the values of (fc, s) for 
k> 0. k and s are the values of k and s where the maxima are reached 



Our criterion with 



11 


K 


= 1 




K 


= 1.1 




K 


= 1.2 




AMDL 




^max 


k 


s 




k 


s 


7^ max 


k 


s 


^"max 


k 


s 


32 


14.6 


9 


4 


15.2 


9 


4 


15.4 


9 


4 


23.2 


9 


5 


512 


11.5 


82 


4 


15.2 


82 


4 


15.9 


82 


4 


25.0 


82 


5 


2048 


10.7 


1 


4 


15.5 


268 


4 


16.0 


256 


4 


25.0 


256 


5 


8192 


12.7 


1 


4 


13.9 


909 


4 


16.0 


909 


4 


25.0 


512 


5 



must stay close to 1 in order to avoid overpenalization. We suggest taking 
K = l.l. 



Table 4 

Case k = n/16 and s = 5. Estimated risk ratio r and mean of the D 's 



Our criterion with 



71 


k 


K 


= 1 


K 


= 1.1 


K 


= 1.2 


AMDL 


r 


D 


r 


b 


r 


D 


r D 


32 


2 


3.43 


2.04 


3.89 


1.94 


4.49 


1.85 


3.39 1.90 


512 


32 


1.96 


33.2 


1.93 


32.6 


1.94 


32.1 


23.5 2.12 


2048 


128 


1.89 


131 


1.89 


130 


1.91 


128 


25 0.52 


8192 


512 


1.91 


532 


1.89 


523 


1.89 


515 


25 0.22 



Table 5 

Case k = 1 and s = 5. For each n, estimated risk ratio followed by the percentages of 
simulations for which D is equal to 0, 1 and larger than 1 



Our criterion with 
K = 1 K = 1.1 if = 1.2 



Histogram Histogram Histogram 



n 


R 


= 


= 1 


> 2 


R 


= 


= 1 


> 2 


R 


= 


= 1 


> 2 


32 


3.6 


7.3 


84.8 


7.9 


3.9 


9.8 


84.6 


5.6 


4.5 


12.9 


82.7 


4.4 


512 


5.4 


14.6 


80.4 


5.0 


6.1 


20.3 


77.8 


1.9 


7.2 


26.0 


73.0 


1.0 


2048 


7.1 


21.8 


74.9 


3.3 


8.2 


28.6 


70.1 


1.3 


9.6 


35.4 


64.1 


0.5 


8192 


9.1 


29.5 


67.7 


2.8 


10.4 


37.4 


61.6 


1.0 


12.2 


45.9 


53.9 


0.2 
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n =32, p=9 n=5 1 2, p=82 




D D 



Fig. 1. Comparison of the penalty functions pen AMDL (D) and pen K (D) for K = 1.1. 

6.3. Variable selection. We present two simulation studies for illustrat- 
ing the performances of our method for variable selection and compare 
them to the adaptive Lasso. The first simulation scheme was proposed by 
Zou (2006). The second one involves highly correlated covariates. 



Description of the procedure. We consider the variable selection problem 
described in Section 2.2 and we implement the procedure considering the 
collection M. for complete variable selection defined in Section 2.2.2 with 
maximal dimension p. We select the subset m of {1,...,N} minimizing 
Criti(m) given at equation (1.4) with penalty function 



pen(m) = pen(|m|) 

n — \m\ 

= K : ; 



n 



\m\ 



1 



-EDkhi 



Iml + l,n 



m 



i, p(H + i) 



N 



m 



This choice for the penalty ensures a quasi oracle bound for the risk of in 
[see inequality (5.5)]. 



The adaptive Lasso procedure. The adaptive Lasso procedure proposed 
by Zou starts with a preliminary estimator a of a as, for example, the or- 
dinary least squares estimator when it exists. Then one computes the mini- 



mizer a w among those a G of the criterion 



CritLasso(a) 



Y 



N 



N 
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where the weights uij = 1/|2|J for j = 1, . . . ,N. The smoothing parameters 
A and 7 are chosen by cross-validation. The set mL aS so is the set of indices 
j such that aj is nonzero. 

Simulation scheme. Let M(n, N) be the set of matrices with n rows and 
N columns. For 9 = (X,a,a) G M(n,N) x R N x R+, we denote by P e the 
distribution of a Gaussian vector Y in M. n with mean /i = Xa and covariance 
cr 2 / n . We consider two choices for the pair (X, a). The first one is based 
on the Model 1 considered by Zou (2006) in its simulation study. More 
precisely, N = 8 and the rows of the matrix X are n i.i.d. Gaussian centered 
variables such that for all 1 < j < k < 8 the correlation between and 
x^ equals 0.5^ - ^. We did S = 50 simulations of the matrix X, denoted 
X s = {X s , s = 1, . . . , S) and define 

6i = {(X, a, o~),X G X s , a = (3, 1.5, 0, 0, 2, 0, 0, 0) T , a G {1,3}}. 

The second one is constructed as follows. Let w be three vectors 

of M n defined by 

= (1,-1,0,..., Of/ y/2, 

x ( 2 ) = (-1, 1.001, 0, . . . , of/ Vi + 1.001 2 , 



X 



(3) 



(1/V2, l/v^, 1/n, . . . , l/n) T /y / l + (n-2)/n 2 



and for 4 < j < n, let x*-^ be the jth vector of the canonical basis of W 1 . 
We take N = n and /x = (n, n, 0, . . . , 0) T . Let a G 1^ satisfying /i = Xa. 
Note that only the two first components of a are nonzero. We thus define 
@ 2 = {(X,a,l)}. 

We choose n = 20 and for each 9 G 0i U 02 we did 500 simulations of Y 
with law Pg. 

Our procedures were carried out considering all (nonvoid) subsets m of 
{1, . . . , iV} with cardinality not larger than p = 8. On the basis of the results 
obtained in the preceding section, we took K = 1.1. 

For the adaptive Lasso procedure the parameters A and 7 are estimated 
using one- fold cross-validation as follows: when 9 G ©1, the values of A vary 
between and 200 and following the recommendations given by Zou, 7 can 
take three values (0.5,1,2). For 9 G 02, A varies between and 40, and 7 
takes the values (0.5, 1, 1.5); the value 7 = 2 leading to numerical instability 
in the LARS algorithm. 

We evaluate the performances of each procedure by estimating the risk 
ratio 

inf meA 4 E e [||/i-/i m || 2 ]' 

the expectation of |m|, and calculating the frequencies of choosing and con- 
taining the true model m®. 
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Table 6 

Case 6 £ Gi. Risk ratio r, expectation of \rh\ and percentages of the number of times rh 
equals or contains the true model (mo = {1,2,5}). These quantities are averaged over the 

S design matrices X in 0i 









a = 1 








a = 3 






r 


E(|m|) 


rh = mo 


rh D mo 


r 


E(|m|) 


rh = mo 


m D mo 


if = 1.1 


1.64 


3.44 


67% 


98.3% 


2.89 


2.23 


12.4% 


20.2% 


A. Lasso 


1.92 


3.73 


62% 


98.9% 


2.58 


3.74 


13.7% 


49.3% 



Results. When € Gi, the methods give similar results. Looking care- 
fully at the results shown in Table 6, we remark that the adaptive Lasso 
method selects more variables than ours. It gives results slightly better when 
a = 3, the risk ratio being smaller and the frequency of containing the true 
model being greater. But, when a = 1, using the adaptive Lasso method 
leads to increase the risk ratio and to wrongly detect a larger number of 
variables. 

In case 9 S 02, the adaptive Lasso procedure does not work while our 
procedure gives satisfactory results (see Table 7). The good behavior of our 
method in this case illustrates the strength of Theorem 2 whose results do 
not depend on the correlation of the explanatory variables. 

Finally, let us emphasize that these methods are not comparable either 
from a theoretical point of view nor from a practical one. In our method 
the penalty function is free from a, while in the adaptive Lasso method the 
theoretical results are given for known a and the penalty function depends 
on a through the parameter A. All the difficulty of our method lies in the 
complexity of the collection A4, making impossible to consider in practice 
models with a large number of variables. 

7. Estimating the pair (/i,cr 2 ). Unlike the previous sections which fo- 
cused on the estimation of we consider here the problem of estimating 
the pair 6 = (fj,,a 2 ). All along, we shall assume that Ai is finite and consider 



Table 7 

Case 6 £ 02 with a — 1. Risk ratio r, expectation of \rh\ and percentages of the number 
of times ra equals or contains the true model (mo = {1,2}) 





r 


E(|m|) 


m = mo 


m D mo 


if = 1.1 


2.35 


2.28 


80.2% 


96.6% 


A. Lasso 


26.5 


10.2 


0.4% 


40% 
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7) 



^(-P/J.CT 2 > P/,T 2 ) = IT 



,r 2 \ a 2 ||/^-^|| 2 
log + — + 



Given some finite collection of models S = {S m ,m G M.} we associate to 
each m G M. the estimator # m of 6 defined by 

, „ A 2 n - - II n m y |' 



n m y, 



AL 



For a given m, the risk of # m can be evaluated as follows. 



Proposition 5. Let 
fi m = U m fi. Then, 



inf K(P ,P V , 



(7.1 

and provided that N m > 2 ? 

Tl 

(7.2) Ee[X:(P 9 ,P^)] < IC(P e ,P e J + - 

(7.3) E,[/C(P e ,P, )]>K(P 9 ,P 9m )y 



(fi m ,a m ) where o 2 m = a 2 + - /i m || 2 /n and 
K{Pe,Pe„ 



log ( H k 



D m + 2 



log 1 



£rr, 

n 



In particular, if D m < N m and N m > 2, then 

D 



(7.4) 



lC(Pg,PgJV^<E[JC(P e ,P § J] 



<JC(P 9 ,P e J + 4(D m + 2) 



As expected, this proposition shows that the Kullback risk of the estima- 
tor 9 m is of order of a bias term, namely /C(P#, Po m ), plus some variance 
term which is proportional to D m , at least when D m < (n/2) A (n — 3). We 
refer to Baraud, Giraud and Huet (2007) for the proof of these bounds. 

Let us now introduce a definition. 

Definition 4. Let Fd,n be a Fisher random variable with D > 1 and 
A^ > 3 degrees of freedom. For x > 0, we set 

Fish[AiV,x] = E[( ^-f )+] <l. 



E(P 



D.N ) 



For < q < 1 we define EFish[D, N, q] as the solution to the equation Fish[L>, N, 
EF\sh[D,N,q]] = q. 
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We shall use the convention EFish[L>, N, q] = for q > 1. Note that the 
restriction N > 3 is necessary to ensure that E(Fd,n) < oo. 

Given some penalty pen* from M into K+, we shall deal with the penal- 
ized criterion 

Crit^(m) 



(7.5) 



log 



N„ 



1 



+ -pen*(m) 



for which our results will take a more simple form than with criteria (1.4) 
and (1.5). In the sequel, we define 

9 = 6 m where fh = arg min Crit^(m). 

Theorem 3. Let S = {S m ,m £ M}, a = mm{N m /n\m £ M} and Ki,K 2 
be two numbers satisfying K 2 > K\ > 1. If D m < n — 5 for all m£ M, then 
the estimator 9 satisfies 



(7.6) 



E[JC(P e ,P § )} 



< 



Ki -1 



inf 



E[JC(P e ,P § J} + -(pen*(m)V D m ) 



where 

Si = 2.he l ^ K ^ne- n l^ K ^\Mt t{an \ 



£ (An + 1)A r. 



meM 



and 



A, 



Fish 



D m + 1,N„ 



1. 



N„ 



lK 2 D m + (K 2 -l)pen*(m) 



K x N m K 2 (D m + l) 

In particular, let C = {L m ,m £ A4} be a sequence of nonnegative weights. If 
for all m £ M., pen*(m) = pen^- K2 ^(m) with 



(7.7) 



K 2 -l 



K 1 (D m + l)N„ 

N m -1 



x EFish(D m + 1, N m - l,e~ Lm ) - D m 
then the estimator 9 satisfies (7.6) with T, 2 < 1.25iTi SmeAf (An + l)e 



This result is an analogue of Theorem 2 for the Kullbach risk. The ex- 
pression of £ is akin to that of Theorem 2 apart from the additional term 
of order ne~ n ^ 4:K2 ^\A4\ 4l ^ an \ In most of the applications, the cardinalities 
\M\ of the collections are not larger than e for some universal constant 
C, so that this additional term usually remains under control. 

An upper bound for the penalty P^k 1 ,k 2 ,c ^ s gi yen i n the following propo- 
sition, the proof of which is delayed to Section 10.2. 
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Proposition 6. Let me M, with D m > 1 and N m > 9. We set D 
D m + 1, N = N m — 1 and 

L m + log5 + l/(iV-2) 
1 - 5/(iV - 2) 

Then, we have the following upper bound on the penalty pen-^ K2 £ ; 



/ \ r , s KxK 2 N+l 
(7-8) pe^ KM (m) < K \_\ N _ 2 

8. Proofs of Theorems 2 and 3. 



l + e 2A'/iv // 1 + ^2A^ 



N D 



D. 



8.1. Proof of Theorem 2. We write henceforth e m = Il m e and /i m = II m /i. 
Expanding the squared Euclidean loss of the selected estimator ^ gives 

II _ - ||2 _ || _ ||2 i 2|| ||2 

II M Mm 1 1 1 1 A* Mm 1 1 ~r ^ Ipmll 

ii n2 ii n2 2ii n2 
= HmII ~ Wm\\ +C ||£m|| 

= IImII 2 - HAm|| 2 + 2o- 2 ||e m || 2 + 2cr(^ m ,e). 

Let m* be an arbitrary index in M. It follows from the definition of m that 
it also minimizes over M the criterion Crit(m) = — ||/*m|| 2 + P en ( m )&m an d 
we derive 

I * 1 1 2 ^ ii i|2 || * 1 1 2 i I *\ - 2 

llM-Mmll < IImII -||Mm*|| +pen(m )cj m * 

- pen(m)o4 + 2a 2 ||e m || 2 + 2cr(^ m ,e) 

^ ^ < || A* - Mm* II 2 ~ o- 2] \e m * | 2 - 2a(n m *,e) + pen(m*)<7^» 

- pen(m)o4 + 2a 2 ||e m || 2 + 2cr(/x^,e) 

< || A* - Mm* II 2 + R(m*) - pen(m)<T 2 , + 2cr 2 ||e m || 2 

- 2a(n - /U m ,e), 
where for all m G A4, 

R{m) = -o- 2 ||e m || 2 + 2a(n - fi m ,e) + pen(m)<7^. 
For each m, we bound (/x — /x m ,e) from above by using the inequality 

(8.2) - 2cr(/x-/i m ,e) < ^||^-/x m || 2 + ETo- 2 (n m ,e) 2 , 

where u m = fj, — /i m /||// — // m || when — /j, m \\ / and u m is any unit vector 
orthogonal to S m otherwise. Note that in any case, (u m ,e) is a standard 
Gaussian random variable independent of ||e m || 2 . For each m, let F m be the 
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linear space both orthogonal to S m and u m . We bound a m from below by 
the following inequality: 



i.3) 



o 2 



m n 



> \\n Fm eV, 



where II^ m denotes the orthogonal projector onto F m . 

By using (8.2), (8.3) and the fact that 2- 1/K < K, inequality (8.1) leads 

to 

K-l n A ll2 

Mm I 



K 



< fi m *\\ 2 + R(m*) 

- pen(m)a^ + (2 - l/K)a 2 \\e m \\ 2 + Ka 2 (u^,e) 2 

< ||M-/"m*|| 2 + R(m*) 

+ ^ [Ko- 2 ||e m || 2 + K<T 2 (ti m ,e) 2 - pen(m)(T^]l m: 



< 



KU m — pen(m) 



N„ 



\^- fi m *\\ 2 + R(m*) + a 2 

where U m = ||e m || 2 + (u m ,£) 2 and V m = ||IIp m e|| 2 . Note that U m and V m 
are independent and distributed as x 2 random variables with respective 
parameters D m + 1 and N m — 1 . 

8.1.1. Casec = 0. We start with the (simple) case c = 0. Then, by taking 
the expectation on both sides of (8.4), we get 



K-l, 



K 



< \\n-n m *\\ 2 +E(R(m*)) 



+ Ka 2 53 E 

meM 

< \\fx-ti m *\\ 2 +E(R{m*)) 



(N m - l)pen(m) 



Vn 



KN n 



N,, 



Ka 2 J2 (D m + l)Dkb\(D m + l,N m -l 



(N rn - l)pen(m) 
KN m 



To conclude, we note that 



E(R(m*)) = -a 2 D m , + pen(m*) a 



IM - Mn 



and m* is arbitrary among A4. 
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8.1.2. Case c> 0. We now turn to the case c> 0. We set V m = V m /N m 
and a m = E(V^). Analyzing the cases V m < a m and V m > a m apart gives 

KU m - pen(m)V m = [KU m - (pen(m) + c - c)V m ]l Vm < am 

+ [KU m ~ (pen(m) + c - c)V m ]l Vm>am 
< ca m + [KU m ~ (pen(m) + c)V m } + lv m <a m 

+ [KU m ~ (pen(m) + c)a m ] + ly m>am 
<c+ [KU m - (pen(m) + c)V m ] + 

+ [KU m - (pen(m) + c)E(V m )} + , 

where we used for the final steps a m = E(V m ) < 1. Going back to the bound 
(8.4), we obtain in the case c > 



K- 1 



K 



< W/J- - Urn* \\ 2 + R{m*) + CO 1 



1.5) 



+ a 2 [KU rn ~ (pen(m) + c)Vml 



+ a 2 ]T [KU m ~ (pen(m) + c)E(V m )]_ 



Now, the independence of U m and V m together with Jensen's inequality 
ensures that 

E([KU m - (pen(m) + c)E(V m )} + ) < E{[KU m - (pen(m) + c)V m ] + ), 
so taking expectation in (8.5) gives 
K — 1 



K 



E[||^-Am|| 2 ] 



< \\fl-fJr, 



-E(R(m*))+ca' 



< 



+ 2Ko 2 Y E 
\\fj, — fjL m * \\ 2 + E(i?(m*)) + cg 2 



KU m ~ (pen(m) + c)-p- 



+ 2Ka 2 J2 (D m + l)Dkh\[D m + l,N m -l, 



(iV m -l)(pen(m)+c) 



To conclude, we follow the same lines as in the case c = 0. 



8.2. Proof of Theorem 3. Let m be arbitrary in M. In the sequel we 
write JC(m) for the Kullback divergence K{P )Jia .2^Pj lrn Q.2 ), namely 

(8-6) JC(m) = | log(^) + llM-/U 2 + ^ 2 _ | (log(J 2 + 1} 
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We also set 4>{x) = log(x) + x" 1 — 1 > for all x > 0, 5 = I/K2, and for 
each m we define the random variable £ m as the number (u m ,s) with u m = 
fJ> — fJ"m/ 1 1 a ~ Mm 1 1 when \\a — /x m || / and ii m is any unit vector orthogonal 
to S m otherwise. 

We split the proof of Theorem 3 into four lemmas. 



Lemma 1. The index m satisfies 

(8.7) K \~ l lC{m)<lC{m) + — — pen*(m) + R\(m) + F(rh) + R,2(m, m) 
K\ 2 

where, for all m,m' £ A4, 



2 \&m> °m 

n , x D m cr 2 ||e m || 2 a(fi-fi m ,e) Sn ( a 2 m 

An a 1 v 2 ikJi 2 , ^ 2 e 2 

2 

1-5 



III 



2 

Proof. We have 



-pen [m). 



w M w n . n i 0* , \\H ~ P>m\\ 2 + na 2 ||^-/i m || 2 + no- 2 
K{m) = K{m) + - log ^ + — 2 — - 2 

= JC(m) + ± log % + V ~ ^ f + ^ ~ l|Y " Y ^ Drh 



2 2(T 2 9 



f A n ||/i-/i m || 2 +na 2 -||y-y m | 12 



2 2«r 2 



in 

n i , 2 ll e ™l| 2 + n - ll £ l| 2 Cr(/X-// m ,e) 



= K(m) + _ logt+ ^ ^ 

_ A™ An _ 2||e m || 2 + n - \\e\\ 2 a{n- /i m ,e) 
2 + 2 2^/a 2 + &l ■ 

With £ m defined before the lemma, we get 

Y-(~\<rY-t \ n 1 , 2 ll £ m|| 2 + »- ||g|| 2 , ||M-Mm|| 2 An 

K{m) < /C m + - log + + , 2 — 

, ^%m<o}j| , An _ 2\\e m \\ 2 + n- \\e\\ 2 a(u- u m ,e) 
2a 2 Ja 2 + 2 2a 2 Ja 2 + a 2 m 
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In view of (8.6), since 5 = I/K2 < 1/Ki < 1, we have 

lb -/An|| 2 _ ICjrh) _ c^lle^H 2 
2Kid% ~ K X 2K x a\ 

< )C(m) a 2 \\e m u2 



K x 2K x a 



3-2 



n 






2Ki 






3) 







and thus, 



ir KM - /c(m) + 2 log 5£ + 1 1 - 2^ J — "2 

, ^{g^oi^A Dm Dm n-\\e\\ 2 

2 ^ 2 + 2 + 2 

_ 0" 2 ||gm|| 2 - /X m ,g) 

a 2 a 2 

" m m 

n , <ri A 1 \ cr 2 ||erf,|| 2 











a 2 


CT 2 




a 2 



< £(m) + (!-£)- log ^ + 1 



al 2 



2 *a? n V JMTJ 5* 

2c2 



+ ^ m - + R 2 (m,m) + i?!(m) 



Finally, we get the result since rh satisfies by definition nlog(a^ l /a m ) < 
pen*(m) — pen*(m). □ 

Lemma 2. For all m£ M, we have E(Ri(m)) < D m /2. 

Proof. Since cp is nonnegative, we have 

D m o" 2 ||e m || 2 a(fj,-fj lm ,e) 8n fa 2 



in. 

2 al ' al 2^\a 2 



Ri(m) = — ^ h 



D m a(n 

-~ + a 2 , • 

Since e and — e have the same distribution, note that 
'((i-Hm,ey 



2E 



(n - D m )E 



+ (n - Z> m )E 



(/JL- fJm,e) 



V ~ A*m|| 2 + Ik - £ m|| 2 + 2(// - /im,e) 

I A* - A^m|| 2 + ||e - £m|| 2 - 2(/z - /i m ,e) 
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\2 



= (n - An)E 
<0. 



-4(H - fim,e) 



|/U - (Jm\\ + Ik - £m\\ ) ~ 4(/i - Hm,e) 



Consequently, the result follows by taking the expectation on both sides 
of (8.8). □ 

Lemma 3. Under the assumptions that for all m £ Ai, N m > an > 5, 
we have for all m £ A4 

E[R 2 (m,m)] < |pen*(m) + 2.hne- {an ~^ s2 l^\M\ A/( - an) . 

Proof. Note that R2(m,m) < R2 ) i{m,rh) + R2,2( m , wi), where 

a 2 a 2 



and 



R 2 ,i(m,m) = -(\\e\\ 2 - (1 - 5)n)J — - 



R2,2{m,rh) = -((1 - 6)n - ||e|| 2 ) + — • 



It remains to bound the expectation of these two terms. 

It follows from the definition of rh and the inequality 1 — e~ u < u which 
holds for all u > that 

-2 



a 2 a 2 a 2 



a 



< — (1 — e -pcn*(m)/n-. 



o-i 



< 



and thus, 



E[R 2:1 (m,m)] = 


2 


"(IMI 2 - 


< 




([IMI 2 - 


< 


2^ 


[INI 2 - 


< 


V5 2 + 2/n 




2 


< 


7 

— pen*(m). 



pen* (to) a 2 



(l-8)n)_ 



(l-<5)n]_ 



(- 


--)} 






-) 


pen* (to) 




n 



n 



(i-*H 2 ) 1/2 Efe) 



N„ 



^(N m -2)(N m -4) 



pen* (to) 
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As to K[R2 t 2(m,rh)], we apply Holder's inequality with p = [an/A\ + 1, 
q = p/(p — 1) and have 



E[R 2 ,2(m,m)] = ^E 



<™E 
~ 2 



(n(l-6)-\\sf) + ^ 



Sri 



F-^llell 2 ^^!-*) 



<j2p\ 1/p 



<|[P(N| 2 <n(l-S))]^ E (^) 



E4^ 



2p 



2p 
0~ m 



1/p 



and by using that P(||e|| 2 < nil — 5)) < exp(— n<5 2 /4) [see Laurent and Mas- 
sart (2000), Lemma 1] together with (9.2) (note that N m i > 2p for all 
m' e M) 



E[R 2t 2(m,m)} 



% 6 



£ 2 £ 



n --«5 2 /(4g) 



E 

•neA-l 

E 



/t (JV m -2)(JV ro -4)..-(JV m -2p) 

g& 

(JV m -2)(JV ro -4)-..(JV m -2p) 



i/p 



1/p 



< 2.5na 2 e- n<52/(49) |/W| 1/p < 2.5ne" (c 



□ 



Lemma 4. Under the assumption that N m > 5 /or ctZ/ m E M, we have 

KIT-. 

.9) E[F(m)} < -r 1 E (An + l^lAn + 1, N m - l,q m ] 



with 



(N m - 1) 



K 1 (D m + l)N n 



n K 2 -l v 

^2 



Proof. Since E[F(m)] < Emex E[F(m)] , it suffices to bound E[F(m)] 
from above for all m. As in the proof of Theorem 2, we introduce C/ m = 
lkm|| 2 + ^ and V m = ||nF m e|| 2 < Nmd-^/a 2 . Since 5 = 1/K 2 , we get 



F(m) 



1 



1_ 2K7'" em " +— 1 {Cm<0}^ 



<r a 1 



u m z, 



(D m + (l-*)pen*(m)) 
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^ K x N m U m If K 2 -l , 
< D m H pen (m 

< K l N m (D m + l) 



2{N m - 1) 

MiVm + 1) N m -1 
V m (D m -l) K 1 {D m + l)N„ 



P6n ^ J 



Since ^rrjj—jj is distributed as a Fisher random variable with D m + 1 and 
N m — 1 degrees of freedom, the result follows by taking the expectation on 
both sides and using N m > 5. □ 

End of the proof of Theorem 3. By taking the expectation on both sides 
of (8.7) and using Lemmas 2, 3 and 4 (we recall that 8 = I/K2) we obtain 

Ki - L 



Ki 



-E[£(m)] 

< E[/C(m)] + ^pen*(m) + ^ + ^ne^ an - 4)s2 ^\M\ i/{an) 

+ S (^m' + l)Fish[An' + l,JV TO /-l,g TO /], 

which leads to (7.6) since m is arbitrary in 7W. Note that the latter series 
is not larger than T, m 'eM( D m' + l)e~ L ™' for pen*(m') = pen^^^m) by 
definition of EFish. 

9. Some preliminary results. The aim of this section is to establish some 
technical results we shall use hereafter. The proofs of these being elementary 
and mainly based on integration by parts, we omit them and rather refer the 
interested reader to the technical report Baraud, Giraud and Huet (2007). 
We start with some moment inequalities on the inverse of a x 2 random 
variable. 

Lemma 5. Let V be a x 2 random variable with N > 2 degrees of freedom 
and noncentrality parameter a. We have 

, \ 1 / 1 \ N 1 
(9.1) — -<E — < — — - — — < 



a + N-2- \V J ~ (N + a)(N -2) ~ N -2 
Let p be some positive integer. If N > 2p, then 

(9-2) e (tt) ^ 



VP J ~ {N - 2) • ■ • (N - 2p) 
Besides, equality holds in (9.2) for a = 0. 
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We recall that <p(t) = (t — 1 — log(t))/2 for all t > 1. For two positive in- 
tegers D and N, Fq n denotes a Fisher random variable with D and N 
degrees of freedom, and we set 

(9.3) B(N/2,D/2)= t N ' 2 {l - t) D ' 2 dt, 



(9.4) 



ip D ,N(t) = <f>(t) 



D(t-1) 



4(D + N + 2) 



for all t > 1. 



The following holds 

Let . 

Dkh\{D,N,x)=F[F D+2 ,N> 



Lemma 6. Let D and N be two positive integers. For all x > 0, 

x 



(9.5) 

If D > 2 and x > D, then 
DkW\(D,N,x) 



D + 2 



Fd,N+2 > 



(N + 2)x 
DN 



(9.6) 
(9.7) 
(9.8) 



< 



1 



B(N/2,l + D/2) \N + x 



N \ N / 2 



D/2 



2{2x + ND) 
N + xJ N(N + 2)x 



2x \ 



(N + 2)x 



'{N + 2)x 



1 + Wd) 



cxp 



-Dlp D ,N 



ND 



The next lemma states similar bounds on Fish (Z? , N, x). 

Lemma 7. Let D and N be integer fulfilling D > 1 and N > 3. Then, 
for any x > 0, 

Fish(D,N,x) 

(9.9) 

' (N-2)D \ N-2 , 

F D +2 n-2 > -, r— x - x F(F D N > x), 

v + ' {D + 2)N ) N y ' h 

where Fd,n is a Fisher random variable with D and N degrees of freedom. 
Moreover, when x > jt^o an d D>2, we have the upper bounds 
Fish(L>,iV,x) 



(9.10) 
(9.11) 



< 



< 



N \ N ' 2 f Dx \ D ' 2 ~ l 2x + N 

N 2 



B(D/2,N/2) \N + DxJ \N + Dx 
l + 2x 



N 



*{F D:N > x) 
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10. Proofs of propositions and corollaries. 

10.1. Proof of Proposition 3. Let m £ M, D £ {0, . . . , n - 2}, N = 
n - D and Md = {m £ M,D m = D}. For all c > 0, (4.3) implies that 
{mJ £ M.D\Lm' < c } is finite and since the map x \— > EDkhi(Z) + 1, N — l,x) 
is decreasing, so is 

{ml £ .M D |EDkhi(L> + 1, N - l,e~ L ™') < c}. 

It follows from the definitions of Crit l and pen^- £ that for some nonnegative 
constant c = c(Y, D, n, m), 

M D = {m! £ M D \ Criti(m') < Crit L (m)} 

is a subset of {m! £ Al_D|EDkhi(L) + 1, N — l,e~ L ™' < c)} and is therefore 
also finite. We deduce that Crit^ is minimum for some element of the finite 
set SA. = UjD=o-M-D> thus showing that m exists. The remaining part of the 
proposition follows by taking c = in Theorem 2. 

10.2. Proofs of Propositions 4 an d 6. Let us start with the proof of 
Proposition 4. We set 

2D \2A l2 



1 + e2 A/«~ + W( 1 + J^_) 



N + 2J D 

Since pen^ £ (?n) = K( -^ +1) EDkhi (D, N, e~ Lm ) , we obtain (4.5) by show- 



b(A,D,N) 
and x = Db(A, D, N) > D. 

N 

ing the inequality EDkhi(D, N, e~ Lm ) < x or equivalently 

(10.1) DkW\(D,N,x) < e~ Lm . 

Let us now turn to the proof of (10.1). Since D > 2 and x > D, we can 
apply (9.7) and get 

Dkhi(D ,^, £ (^) P (w,^) 

<(i + «Z)) P( ^ N+2 > MA>AW)) . 

The deviation inequality on Fisher random variables available in Baraud, 
Huet and Laurent (2003) (Lemma 1) gives with F = Frj,jv+2 

P(F D>N+2 >b(A,D,N)) 



< 



N + 2) D \ N + 2J D 



N + 2 D V N + 2 2D 



<e~ A 
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and hence, 



Dkhi(L>,iY,a;) < 1 + 



2b(A,D,N )\^ A 
N 



By using D > 2 and N > 6, we crudely bound b(A, D, N) from above as 
follows: 



b(A,D,N) 



1 + e 



2A/(7V+2) 



1 + 



2D \ 2A 



N + 2 D 



< 



1 + y/A\l-e 4A / N ] 2 



<(l + A)fl + ^ e 4A ^) 



<|(l + A)e 4A / 7V 



and deduce 



< 



5e 4 Wl 



+ 



1 + A 



N 



Since A(l — 5/iV) = -L m + log5 + 1/iV, inequality (10.1) follows, thus com- 
pleting the proof of (4.5). 

We turn to (4.6). When D m = 0, we obtain (4.6) by showing (10.1) for 



x = 3 



1 + e 



2L m /N 



1 + 



6 \ 2L r 



N 3 



We deduce from (9.5) that Dkhi(l,iV» < P(F 3tN > x/S). Again, the devia- 
tion inequality on Fisher random variables gives, with L = L m , 

HFs,n > x/S) 

F 3 ,N> 



<r(^> 1+a J( 1+ £)f + ^(l + «)£ 



<]P(i 7 3JV>l + 2W(l + 



3 \L 



N 3 



+ 1 + 



6 \ N, 



N 6 



1] )<e 
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leading to (10.1). The proof of Proposition 4 is complete. 

Since the proof of Proposition 6 is similar, we only give the main steps. 
We set 



b'(A',D,N) 



2D\ 2 A' 



N J D 



and x' = Nb'{A', D, N) /(N - 2) > N/(N- 2). In view of (9.11) and Lemma 1 
in Baraud, Huet and Laurent (2003), we have 

Fish(A N, x') <(l+ 2b ' { ^ D 2 N) ) nF D ,N > b'(A', D,N)) 



< 



2b'(A',D,N) \ _ A , 
N-2 



Furthermore, when D>2 and N > 8, 

b'{A', D, N)<[1 + e 2A / N J3/2VN} 2 < |(1 + A')e iA '/ N , 
which enforces 

Fish(D,iV,x') < ^l^-^e-^^l^- 2 ^ < e~ L ™. 
As a consequence, 



penK ltK2iC {m) < — — 



K X K 2 D(N + 1) 



< 



2-1 N 

K X K 2 N + l 
Ko-lN-2 



EFish(L>,iV,e" jLm ) < 



£ _ K X K 2 D(N + 1), 



Ko-1 N 



l + e 2AVAT f 1 + 



2D\ 2 A' 
~Wj~D~ 



D. 



10.3. Proof of Corollary 2. We start with an approximation lemma. 

Lemma 8. For all f G S 1 (q, R) and j > 1 such that 1 < q < 2° ' — I, there 
exists m G M.j, q and g G T m such that \\f — g\\oo < Rq2~ 3 . 

Proof. For j > 1 and a G [0, 1], we define a& = inf{x £Vj-.x> a}. For 
all x G [0, 1), one can write 

9+1 



f(x) = f(0)+ rj2 ai l [ai _ uai) (t)dt. 

J0 4 = 1 



We take for x G [0, 1) 



g(x) = f(0)+ f J^Oil U) jj),(t)dt. 

JO L i-l> i > 
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(i) (i) 

Since one may have a[_ 1 = a\ for some indices i, the function g belongs to 
some space T m i with w! G -Mj,q' and q' < q. By taking (any) m G Mj,q such 
that m' C m, one has g G .F m . 

For each i G {1, . . . , q + 1}, we either have aj_i < afJi <ai< af' or aj_i < 

(7) (7) 
< a[__! 1 = a\ . In any case, we have 



L l — 1' Z L 

and consequently, for all x G [0, 1), 



f(x)-g(x)= I 5^ a *( 1 [«i-i,«i[( < ) ~ 1 [ t ^} 1 ^fi : ^ dt 

l 1' 2 



Li=l 



+ ail [ao,aW) [ (*)-^+l 1 [a 9+1 , a «[(*) 



dt. 



(7) (7) (7) — ' 

Since ao = Oq = 0, a q +i = = 1, and \a\ — Oj| < 2 we obtain for 

f€S x (q,R) 



\f(x) - g(x)\ < £ |ai +1 - Oi|2-^ < i?g2- 



i=l 



□ 



We take m G .Mi ; « as in the lemma above with j such that 



2 J 



f-i 



2A 



We deduce from Proposition 4 [inequality (4.5)] that when p < (nn — 2) A 
(n — 9) and R < ae K n l q , we have 

2J ^ 36<3r/(l— K)ri f 



pen X £ (m) < C(K, K)q[ — 



<C(K,K,K?)q 



1 + log 1 V 



log 

V q 

nR 2 



qa z 



Besides, 



M-AfmllV-, , P en K,A m )\ . nR 2 q 2 2- 2 i ( pen x>£ ( 



(7 



1 + 



< 



1 + 



iVr, 
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pen K Am) 



<q 1 + 



N„ 



< C'{K,K,K')q 

and the result follows from (4.2). 

10.4. Proof of Corollary 3. We write 



1 nR 2 

l+log( IV— j 



n J V {n/2) a / d 

and set m = (r, fci, . . . , fc^) where 

' A \ 1/0(4 



A',; 



»? 



1 d. 



It follows from our choice of rj > ^^y^M) anc ^ * ne assum Ption n > 14 that 

(r + l)%---/c d <n/2<ra-2. 

Moreover, under the assumptions n a R 2a+d > R d a 2a and n a / d Ri > 2 a / d R(r + 
l) a we have ki>l for all i. Consequently, m£ M. 

From formula (4.25) in Barron, Birge and Massart (1999) we know that 
there exist a constant C = C(d,r) and a piecewise polynomial P in J- m such 
that 

d 

11/ -PWoo^cY^RiK* 1 <c'v- 

i=l 

Moreover, since the assumptions of Proposition 4 hold, we have 

pen^m) < C(K)L m < C{K){r + l) d R 2d ^ 2a+d \n/a 2 ) d ^ 2a+d \ 

where the second inequality follows from the fact that 7] > ^BlElzl. ^ot/(2a+d) _ 
It remains to apply Theorem 2 to obtain the result. 

Acknowledgment. We are very grateful to Lucien Birge for his useful 
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